Sub-story detection in Twitter with hierarchical Dirichlet processes
نویسندگان
چکیده
Social media has now become the de facto information source on real world events. The challenge, however, due to the high volume and velocity nature of social media streams, is in how to follow all posts pertaining to a given event over time – a task referred to as story detection. Moreover, there are often several different stories pertaining to a given event, which we refer to as sub-stories and the corresponding task of their automatic detection – as sub-story detection. This paper proposes hierarchical Dirichlet processes (HDP), a probabilistic topic model, as an effective method for automatic substory detection. HDP can learn sub-topics associated with sub-stories which enables it to handle subtle variations in sub-stories. It is compared with stateof-the-art story detection approaches based on locality sensitive hashing and spectral clustering. We demonstrate the superior performance of HDP for sub-story detection on real world Twitter data sets using various evaluation measures. The ability of HDP to learn sub-topics helps it to recall the substories with high precision. Another contribution of this paper is in demonstrating that the conversational structures within the Twitter stream can be used to improve sub-story detection performance significantly. keywords : sub-story detection, hierarchical Dirichlet process, spectral clustering, locality sensitive hashing
منابع مشابه
Dependent Hierarchical Normalized Random Measures for Dynamic Topic Modeling
We develop dependent hierarchical normalized random measures and apply them to dynamic topic modeling. The dependency arises via superposition, subsampling and point transition on the underlying Poisson processes of these measures. The measures used include normalised generalised Gamma processes that demonstrate power law properties, unlike Dirichlet processes used previously in dynamic topic m...
متن کاملTwitter-Network Topic Model: A Full Bayesian Treatment for Social Network and Text Modeling
Twitter data is extremely noisy – each tweet is short, unstructured and with informal language, a challenge for current topic modeling. On the other hand, tweets are accompanied by extra information such as authorship, hashtags and the user-follower network. Exploiting this additional information, we propose the Twitter-Network (TN) topic model to jointly model the text and the social network i...
متن کاملStreaming First Story Detection with application to Twitter
With the recent rise in popularity and size of social media, there is a growing need for systems that can extract useful information from this amount of data. We address the problem of detecting new events from a stream of Twitter posts. To make event detection feasible on web-scale corpora, we present an algorithm based on locality-sensitive hashing which is able overcome the limitations of tr...
متن کاملEnabling Hierarchical Dirichlet Processes to Work Better for Short Texts at Large Scale
Analyzing texts from social media often encounters many challenges, including shortness, dynamic, and huge size. Short texts do not provide enough information so that statistical models often fail to work. In this paper, we present a very simple approach (namely, bagof-biterms) that helps statistical models such as Hierarchical Dirichlet Processes (HDP) to work well with short texts. By using b...
متن کاملTime-Varying Topic Models using Dependent Dirichlet Processes
We lay the ground for extending Dirichlet Processes based clustering and factor models to explicitly include variability as a function of time (or other known covariates) by integrating a Dependent Dirichlet Processes into existing hierarchical topic models. Time-Varying Topic Models using Dependent Dirichlet Processes Nathan Srebro Sam Roweis Dept. of Computer Science, University of Toronto, C...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- Inf. Process. Manage.
دوره 53 شماره
صفحات -
تاریخ انتشار 2017